About this project

This project is about learning data science using R.

Github

More information can be found in my Github repository


RStudio Exercise 2: Regression and model validation

Data Wrangling

See file learning2014.r in my GitHub repository

Data analysis for RStudio Exercise 2

Install required packages

First, we install the required packages for the analysis: ggplot2, GGally, gridExtra (for multiple plots)

install.packages("ggplot2", repos = "http://cloud.r-project.org")
install.packages("GGally", repos = "http://cloud.r-project.org")
install.packages("gridExtra", repos = "http://cloud.r-project.org")

Read data from the web:

lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)

Assess the data structure:

summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The dataset is used to study the relationship between learning approaches and students’ achievements in an introductory statistics course in Finland. It consists of 166 observations of 7 variables. The sample consists of 110 females and 56 males. The respondents’ age ranges between 17 and 55. The median age is 22.

str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
  1. age describes the age of the respondent.
  2. gender describes the sex of the respondent.
  3. points describes the points of the respondent.
  4. deep is a composite variable that measures the respondents’ inclination to deep learning methods.
  5. stra is a composite variable that measures the respondents’ inclination to strategic learning methods.
  6. surf is a composite variable that measures the respondents’ inclination to surface learning methods.
  7. attitude is a composite variable that measures the respondents’ attitude towards learning statistics.

Summary statistics of the learning data

First, we load the required libraries (GGally, ggplot2, gridExtra)

library(ggplot2)
library(GGally)
library(gridExtra)

Plot summary of learning data with ggpairs()

Overview of the learning data:

ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20))) + ggtitle("Learning 2014: Summary Statistics")

Next, we plot the each learning approach against the data to assess correlation between the variables.

Next, we plot all learning variables and points

# initialize plots with data and aesthetic mapping

# attitude vs points
p1 <- ggplot(lrn14, aes(x = attitude, y=points, col=gender)) + geom_point() + ggtitle("Attitude vs points") + theme(legend.position="bottom")

# stra vs points
p2 <- ggplot(lrn14, aes(x = stra, y=points, col=gender)) + geom_point() + ggtitle("Strategic learning vs points")

# surf vs points
p3 <- ggplot(lrn14, aes(x = surf, y=points, col=gender)) + geom_point() + ggtitle("Surface learning vs points")

# deep vs points
p4 <- ggplot(lrn14, aes(x = deep, y=points, col=gender)) + geom_point() + ggtitle("Deep learning vs points")

# create common legend
g_legend<-function(a.gplot){
  tmp <- ggplot_gtable(ggplot_build(a.gplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)}

mylegend<-g_legend(p1)

# multiple plot
grid.arrange(arrangeGrob(p1 + theme(legend.position="none"),
                               p2 + theme(legend.position="none"),
                               p3 + theme(legend.position="none"),
                               p4 + theme(legend.position="none"),
                               nrow=2),
                   mylegend, nrow=2,heights=c(10,1))

Attitude and study achievement seems to be correlated, whether there does not appear correlation with study achievement and different learning approaches.

Next, we plot histograms for each variable according to gender

p6 <- ggplot(lrn14, aes(x=attitude, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5) 

p7 <- ggplot(lrn14, aes(x=deep, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p8 <- ggplot(lrn14, aes(x=stra, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p9 <- ggplot(lrn14, aes(x=surf, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p10 <- ggplot(lrn14, aes(x=points, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)

p11 <- ggplot(lrn14, aes(x=points, fill=gender, color=gender)) +
  geom_histogram(position="identity", alpha=0.5)


grid.arrange(arrangeGrob(p6 + theme(legend.position="none"),
                         p7 + theme(legend.position="none"),
                         p8 + theme(legend.position="none"),
                         p9 + theme(legend.position="none"),
                         p10 + theme(legend.position="none"),
                         p11 + theme(legend.position="none"),
                         nrow=3),
             mylegend, nrow=2,heights=c(10,1), top="Learning 2014: Frequencies")

Males appear to have a generally more positive attitude towards statistics than females, but there does not appear to be other differences between them.

Regression analysis

Next, we move to regression analysis to determine what variables affect study achivement (points).

  • Model 1:
  • Dependent variable: points
  • Explanatory variables: attitude, stra, surf
my_model1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(my_model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Variables stra and surf are not statistically significant. Drop surf as explanatory variable, because p-value is greatest.

  • Model 2:
  • Dependent variable: points
  • Explanatory variables: attitude, stra
my_model2 <- lm(points ~ attitude + stra, data = lrn14)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Variables stra is not statistically significant at the 5% level. Keep attitude as only explanatory variable with intercept, because variable is statistically significant at the 5% level.

  • Model 3:
  • Dependent variable: points
  • Explanatory variables: attitude
my_model3 <- lm(points ~ attitude, data = lrn14)
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The explanatory variable attitude and the intercept ar statistically significant. The model (model 3) may be interpreted such that a one point increase in attitude increases test points by 3.5.

Model validation

par(mfrow = c(1,3))
plot(my_model3, which=c(1,2,5))

Residuals appear to be randomly distributed and centered around zero, which suggests that the model satisifies the condition that the errors are not serially correlated.

The Q-Q-plot shows that the relationship between the sample percentiles and the theoretical samples from the normal distribution is linear, which suggests that the model satisfies the condition that the errors are normally distributed.

The relationship bewteen residuals and leverage shows that there are no outliers in the data that significantly affect the regression results.


RStudio Exercise 3: Logistic regression

Data Wrangling

See file create_alc.R in my GitHub repository

Data analysis for RStudio Exercise 3

Install required packages

First, we install the required packages for the analysis: tidyr, ggplot2, GGally, gridExtra (for multiple plots)

install.packages("tidyr", repos = "http://cloud.r-project.org")
install.packages("ggplot2", repos = "http://cloud.r-project.org")
install.packages("GGally", repos = "http://cloud.r-project.org")
install.packages("gridExtra", repos = "http://cloud.r-project.org")

and make the packages available

library(tidyr); library(dplyr); library(ggplot2); library(gridExtra)

Read data from data folder:

alc <- read.table("C:/Users/palme/Documents/IODS-project/data/alc.csv", sep=",", header=TRUE)

Glimpse at the alc data

glimpse(alc) 
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

Dataset information:

This dataset contains measures of student achievement of two Portuguese schools. The data include student grades, demographic, social and school related features as additional variables). The dataset provides average performance in two distinct subjects: mathematics and Portuguese language.

The dataset consists of 35 variables and 382 observations

Variables:

  1. school - student’s school
  2. sex - student’s sex
  3. age - student’s age
  4. address - urban or rural
  5. famsize - family size
  6. Pstatus - parent’s cohabitation status
  7. Medu - mother’s education
  8. Fedu - father’s education
  9. Mjob - mother’s job
  10. Fjob - father’s job
  11. reason - reason to choose this school
  12. guardian - student’s guardian
  13. traveltime - home to school travel time
  14. studytime - weekly study time
  15. failures - number of past class failures
  16. schoolsup - extra educational support
  17. famsup - family educational support
  18. paid - extra paid classes
  19. activities - extra-curricular activities
  20. nursery - attended nursery school
  21. higher - wants to take higher education
  22. internet - Internet access at home
  23. romantic - with a romantic relationship
  24. famrel - quality of family relationships
  25. freetime - free time after school
  26. goout - going out with friends
  27. Dalc - workday alcohol consumption 28 Walc - weekend alcohol consumption use
  28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  29. health - health status
  30. absences - number of absences
  31. G1 - first period average grade (math and portuegese)
  32. G2 - second period average grade (math and portuegese)
  33. G3 - final period average grade (math and portuegese)
  34. alc_use’ - average of ‘Dalc’ and ‘Walc’
  35. ‘high_use’ -is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

Analysis of the causes of high alcohol consumption

First we analyze the data, by drawing a bar plot of each variable

gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Main hypothesis

There is quite a lot of variation between respondents’ free time and study time, so we use these variables along with age and sex as explanatory variables for alcohol consumption. The main hypothesis is that respondents with a lot of free time also have time to consume alcohol. Moreover, respondents that report many hours of study choose not to consume alcohol, which might decrease their study ability to study. Morever, age might be positively correlated with alcohol consumption, as older respondents drink more than their younger counterparts. Finally, there might be gender differences with respect to alcohol consumption.

We study the relationship between high_use and studytime, age, freetime, sex First, we produce summary statistics by group

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean(studytime), mean(age), mean(freetime))
## # A tibble: 4 x 6
## # Groups:   sex [2]
##   sex   high_use count `mean(studytime)` `mean(age)` `mean(freetime)`
##   <fct> <lgl>    <int>             <dbl>       <dbl>            <dbl>
## 1 F     FALSE      156              2.34        16.6             2.93
## 2 F     TRUE        42              2           16.5             3.36
## 3 M     FALSE      112              1.88        16.3             3.39
## 4 M     TRUE        72              1.64        17.0             3.5

Graphical illustration of data:

g1 <- ggplot(alc, aes(x = high_use, y = age, col = sex)) #plot high_use and age, separate by sex
g1 + geom_boxplot() + ggtitle("Age by alcohol consumption and sex") # define the plot as a boxplot and draw it

g2 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex)) # plot high_use and freetime by sex

g2 + geom_boxplot() + ggtitle("Study time by alcohol consumption and sex") # define the plot as a boxplot and draw it

g3 <- ggplot(alc, aes(x = high_use, y = freetime, col = sex)) #plot of high_use and freetime

g3 + geom_boxplot() + ggtitle("Free time by alcohol consumption and sex") # define the plot as a boxplot and draw it

Main findings:

  • The mean age of males that report high use of alcohol is higher than those males who report low level drinking.
  • The mean age of females that report high use of alcohol is lower than those consume less alcohol, although the mean age for both groups is almost the same.
  • Males and females who report high use of alcohol on average have more free time than those who consume less alcohol.
  • Males and females who report high use of alchol on average study less than those who consume less alcohol.
  • Higher share of men report high alcohol use than women. (0.64 vs 0.27)

Logistic regression

Run logistic regression with high_use as the dependent variable and studytime, age, freetime, sex as explanatory variables

m <- glm(high_use ~ age + studytime + freetime + sex, data = alc, family = "binomial")

The summary of the model

summary(m)
## 
## Call:
## glm(formula = high_use ~ age + studytime + freetime + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6526  -0.8509  -0.6602   1.1301   2.2095  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -4.8768     1.7790  -2.741  0.00612 **
## age           0.2290     0.1011   2.265  0.02349 * 
## studytime    -0.4697     0.1602  -2.931  0.00337 **
## freetime      0.2503     0.1229   2.037  0.04166 * 
## sexM          0.5854     0.2482   2.359  0.01833 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 431.41  on 377  degrees of freedom
## AIC: 441.41
## 
## Number of Fisher Scoring iterations: 4

All coefficients are statistically significant at the 5% level, suggesting that the chosen variables explain alcohol consumption.

Compute the Odds Ratios (OR)

OR <- coef(m) %>% exp

and confidence intervals

CI <- confint(m) %>% exp

The Odds Ratios and confidence intervals are provided in the table below:

cbind(OR, CI)
##                      OR        2.5 %    97.5 %
## (Intercept) 0.007621647 0.0002191106 0.2381440
## age         1.257384089 1.0338141044 1.5381420
## studytime   0.625158887 0.4522877516 0.8490039
## freetime    1.284393173 1.0119391550 1.6399589
## sexM        1.795640894 1.1056484552 2.9303141

The odds ratios can be interpreted as the change in the odds of high alcohol use given a one unit increase in the explanatory variable. For example, the regression coefficient for age is 0.2290334. This indicates that a one unit increase in age increases the odds of being high alcohol use by exp(0.0.2290334)=1.3) times.

To summarize, the amount of free time, age, and being male increase the odds of high alcohol consumption, whereas the amount of study time decreases the odds of high alcohol use, as hypothesized above.

Exploring the predictive power of the model

First make the prediction of the prediction

probabilities <- predict(m, type = "response") # predict() the probability of high_use
alc <- mutate(alc, probability = probabilities) # add the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5) # use the probabilities to make a prediction of high_use

and check to see the predictions of the last 10 observations

select(alc, studytime, freetime, age, sex, high_use, probability, prediction) %>% tail(10) # see the last ten original classes, predicted probabilities, and class predictions
##     studytime freetime age sex high_use probability prediction
## 373         1        3  19   M    FALSE   0.5845170       TRUE
## 374         1        4  18   M     TRUE   0.5896690       TRUE
## 375         3        3  18   F    FALSE   0.1958322      FALSE
## 376         1        4  18   F    FALSE   0.4445380      FALSE
## 377         3        4  19   F    FALSE   0.2822698      FALSE
## 378         2        3  18   F    FALSE   0.2803350      FALSE
## 379         2        2  18   F    FALSE   0.2327073      FALSE
## 380         2        1  18   F    FALSE   0.1910235      FALSE
## 381         1        4  17   M     TRUE   0.5333414       TRUE
## 382         1        4  18   M     TRUE   0.5896690       TRUE

Predictions vs actual values

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   255   13
##    TRUE     95   19
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction)) # plot 'high_use' versus 'probability' in 'alc'
g + geom_point() 

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66753927 0.03403141 0.70157068
##    TRUE  0.24869110 0.04973822 0.29842932
##    Sum   0.91623037 0.08376963 1.00000000
#define a loss function (average prediction error)
  loss_func <- function(class, prob) {
    n_wrong <- abs(class - prob) > 0.5
    mean(n_wrong)
  }

There is a high number of false negatives, i.e., that the model predicts low alcohol use, wheras the number of false positives is quite low. However, this suggests that the predictive ability of the model is quite low. This could be adjusted by lowering the acceptance probability of the model.

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2827225
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

On average the model makes wrong predictions 0.2827225 percent of the time, which is better than the average number of wrong predictions given random guess:

# compute the average number of wrong predictions in the (training) data using random predictions
loss_func(class = alc$high_use, prob = runif(length(alc$high_use))>0.5)
## [1] 0.5340314

As the number of predictions goes to infinity, the average share of wrong predictions should be 0.5.

Perform K-fold cross-validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

The average number of wrong predictions in the cross validation is 0.2958115, which is higher compared to the model in DataCamp. A better model could probably be found using more/other explanatory variables.

Such model can be where high_use is explained bystudytime, sex, and goout.

m2 <- glm(high_use ~ studytime + sex + goout, data = alc, family = "binomial")

summary(m2)
## 
## Call:
## glm(formula = high_use ~ studytime + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7187  -0.8058  -0.5224   0.8849   2.6300  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.7318     0.5683  -4.807 1.53e-06 ***
## studytime    -0.4822     0.1683  -2.865  0.00417 ** 
## sexM          0.6722     0.2581   2.605  0.00919 ** 
## goout         0.7519     0.1186   6.340 2.30e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 393.95  on 378  degrees of freedom
## AIC: 401.95
## 
## Number of Fisher Scoring iterations: 4
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv2$delta[1]
## [1] 0.2460733


RStudio Exercise 4: Clustering and classification

Housekeeping

First, install required packages:

install.packages('plotly', repos = "http://cloud.r-project.org")
install.packages('dplyr', repos = "http://cloud.r-project.org")
install.packages('ggplot2', repos = "http://cloud.r-project.org")
install.packages('GGally', repos = "http://cloud.r-project.org")
install.packages("corrplot", repos = "http://cloud.r-project.org")

Access the required libraries:

library(dplyr)
library(MASS)
library(ggplot2)
library(GGally)

Summary of the data

For this exercise, we are using the Boston dataset, which contains information on housing in Boston Massachusetts. For more information, see this website

We start out by loading the data from the MASS package.

# load the data
data("Boston")

Explore the dataset Boston:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Dataset contains 506 observations of 14 variables.

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. BLACK - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000’s
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can notice that the variable chas is a dummy variable (either 0 or 1), but other variables are continuous and have varying means and variances.

We can see this more clearly by visualizing the data.

Visualize the data

# use ggplot to explore the data
#ggplot(Boston, aes(x=zn)) + geom_histogram(binwidth=50)

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

We can also plot the correlation between variables:

library(corrplot)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

There appears to be strong correlation between crime (crim) and pupil-teacher-ratio (ptratio) and accessibility to highways (rad)

Standardize the data for the linear discrimination analysis (LDA)

boston_scaled <- scale(Boston) # center and standardize variables

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Looking at the scaled data, we can see that each variable is now centered around zero, i.e. the mean is zero by definition, whereas the mean varied in the original data.

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

## Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). 
# Use the quantiles as the break points in the categorical variable.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

LDA on the train set

First, we fit the linear discriminant analysis on the train set by using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# fit the LDA model
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2623762 0.2400990 0.2500000 
## 
## Group means:
##                  zn      indus          chas        nox         rm
## low       0.8788254 -0.8448951 -0.1148450583 -0.8398171  0.3785072
## med_low  -0.1260205 -0.2804308 -0.0123318823 -0.5429258 -0.1937130
## med_high -0.3705437  0.2214765  0.2147348790  0.4301176  0.1339978
## high     -0.4872402  1.0171306  0.0005392655  1.0392317 -0.3320834
##                 age        dis        rad        tax     ptratio
## low      -0.8693389  0.8158541 -0.7039422 -0.7480676 -0.40857083
## med_low  -0.3059675  0.3171846 -0.5484874 -0.4384644 -0.01737183
## med_high  0.4507093 -0.3922718 -0.4253975 -0.3097610 -0.28565162
## high      0.8107304 -0.8563937  1.6379981  1.5139626  0.78062517
##                black       lstat        medv
## low       0.38342210 -0.75643701  0.46592858
## med_low   0.31542441 -0.10657829 -0.03228813
## med_high  0.07228859  0.07420836  0.21254762
## high     -0.81610733  0.85640624 -0.66778816
## 
## Coefficients of linear discriminants:
##                 LD1         LD2          LD3
## zn       0.10573624  0.63461545 -1.010077528
## indus    0.06209234 -0.20729719  0.044815610
## chas    -0.08090981 -0.03356836  0.102028809
## nox      0.35729403 -0.83893178 -1.242488243
## rm      -0.07760302 -0.15897687 -0.272421501
## age      0.17987257 -0.35317138 -0.003453979
## dis     -0.10081633 -0.35860069  0.168198613
## rad      3.48345792  0.91222385 -0.405563996
## tax     -0.05225204  0.01229171  0.943931560
## ptratio  0.10990165 -0.03769890 -0.276590261
## black   -0.11774456  0.03047403  0.056390236
## lstat    0.24770754 -0.38352779  0.301445533
## medv     0.21545997 -0.54100041 -0.088915056
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9542 0.0338 0.0121

We can see that the first linear discriminant (LD1) explains most of the differences between groups, as the proportio of the trace is the very large (0.9563)

We can see this more clearly by plotting the results:

# create the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

First linear discriminant (LD1) explains the high crime areas, but cannot predict differences between the lower crime neighborhoods. Second linear discriminant (LD2) can explain some differences between with the low crime neighbordhoods, with a lower value of LD2 associated with a higher crime levels.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

summary(test)
##        zn              indus               chas         
##  Min.   :-0.4872   Min.   :-1.55630   Min.   :-0.27233  
##  1st Qu.:-0.4872   1st Qu.:-0.91238   1st Qu.:-0.27233  
##  Median :-0.4872   Median :-0.40622   Median :-0.27233  
##  Mean   : 0.1042   Mean   :-0.09802   Mean   :-0.07933  
##  3rd Qu.: 0.4025   3rd Qu.: 1.01499   3rd Qu.:-0.27233  
##  Max.   : 3.5861   Max.   : 2.11552   Max.   : 3.66477  
##       nox                 rm                age          
##  Min.   :-1.46443   Min.   :-3.44659   Min.   :-2.20524  
##  1st Qu.:-0.94233   1st Qu.:-0.49406   1st Qu.:-0.92810  
##  Median :-0.20448   Median : 0.03112   Median : 0.25135  
##  Mean   :-0.05051   Mean   : 0.03162   Mean   :-0.06114  
##  3rd Qu.: 0.64339   3rd Qu.: 0.57409   3rd Qu.: 0.82419  
##  Max.   : 2.72965   Max.   : 2.92103   Max.   : 1.11639  
##       dis                rad                tax          
##  Min.   :-1.16357   Min.   :-0.98187   Min.   :-1.30676  
##  1st Qu.:-0.70621   1st Qu.:-0.63733   1st Qu.:-0.71638  
##  Median :-0.15985   Median :-0.52248   Median :-0.55321  
##  Mean   : 0.09156   Mean   : 0.04274   Mean   :-0.01548  
##  3rd Qu.: 0.79494   3rd Qu.: 1.65960   3rd Qu.: 1.52941  
##  Max.   : 2.57768   Max.   : 1.65960   Max.   : 1.52941  
##     ptratio             black              lstat         
##  Min.   :-2.51994   Min.   :-3.59148   Min.   :-1.42599  
##  1st Qu.:-0.83399   1st Qu.: 0.18282   1st Qu.:-0.75662  
##  Median :-0.02565   Median : 0.37884   Median :-0.32601  
##  Mean   :-0.08271   Mean   : 0.03566   Mean   :-0.06622  
##  3rd Qu.: 0.80578   3rd Qu.: 0.42720   3rd Qu.: 0.42143  
##  Max.   : 1.26768   Max.   : 0.44062   Max.   : 3.40663  
##       medv         
##  Min.   :-1.68888  
##  1st Qu.:-0.47926  
##  Median :-0.06337  
##  Mean   : 0.03587  
##  3rd Qu.: 0.43679  
##  Max.   : 2.98650
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       8        1    0
##   med_low    5      13        2    0
##   med_high   0      15       12    2
##   high       0       0        0   26

The results in the above table show that the model predicts high crime areas well, but low, medium-low and medium-high crime areas are more difficult to predict. The model is unable to place low crime areas into the low category, given that it predicts low areas to belong to low and medium-low areas with the same probability.

K-means

#Reload the data from the MASS package
library(MASS)
data('Boston')

#Scale the data
boston_scaled <- scale(Boston)

Next, we calculate the distances between variables

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

We can see that the Euclidean distances are shorter than the Manhattan distances, because the Euclidean distance calculates the shortest distance between variables, whereas the Manhattan distance measure the distance between variables along the axes at right angles.

Next, we look at the results from k-means clustering with optimal number of clusters

set.seed(123) # Fix random number generator

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Because, the total within sum-of-squares decreases the fastest between one and two clusters, we perform k-means clustering with two centers.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Then we plot the original dataset with the clusters

# plot the original Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# k-means clustering
km <-kmeans(boston_scaled, centers = 4)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       56    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
# plot the Boston dataset with clusters
pairs(boston_scaled[, 1:6], col = km$cluster)

# plot the Boston dataset with clusters
pairs(boston_scaled[, 7:14], col = km$cluster)

Bonus

Next, we perform linear discriminant analysis with the Boston data by using the clusters from the k-means testing as the target variables.

# Read data
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
# Scale data
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)


# Kmeans with three centers
km1 <-kmeans(boston_scaled, centers = 4)

lda.fit1 <- lda(km1$cluster ~ ., data = boston_scaled)


# target classes as numeric
classes <- as.numeric(km1$cluster)

# plot the lda results
plot(lda.fit1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit1, myscale = 1)

We can see that the linear discriminants are useful at explaining the different clusters along the two dimensions, because the clusters are quite separate from one another.

Super Bonus

Use package plotly to create 3D images

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')